Loading necessary libraries

library(gprofiler2)
library(tidyverse)
library(readxl)

Check your working directory

  • Paths provided within this .rmd assume you are working within the Hackathon19_visualise_ontology project directory. Provided you have opened Hackathon19_visualise_ontology as a project this should already be the case.
  • You can check by doing the following:
getwd()
## [1] "/home/rreynolds/misc_projects/Hackathon19_visualise_ontology"

Running gProfiler2

  • Input genes are sourced from the latest PD GWAS.
# import PD GWAS gene list
geneList <- 
  readxl::read_excel(path = "./data/TableS2_Detailed_summary_statistics_nominated_risk_variants.xlsx") %>% 
  # Extract "Nearest Gene" column as vector
  .[["Nearest Gene"]] %>% 
  # Remove NAs
  na.omit() %>%
  # Convert to character vector
  as.character() %>% 
  # Keep only unique genes
  unique()

# running gprofiler
gprofilerOutput <- gprofiler2::gost(geneList, 
                               organism = "hsapiens", # set the organism 
                               correction_method = c("gSCS"), # select a correction method (gSCS reccomended in gost documentation)
                               domain_scope = c("annotated"), # select whether to restrict the background set
                               sources = c("GO"), # databases would you like to search for pathways & processes (others: "KEGG","REAC","WP")
                               significant = TRUE, # when TRUE displays only significant results (p<0.05)
                               evcodes = FALSE, # when TRUE displays the IDs of the genes calling each pathway
                               ordered_query = FALSE) # when TRUE the input list is ordered in some meaningful way
  • The background list (domain_scope argument) is an important choice, as it should represent the population of genes your list was sampled from. E.g. If you were to perform a differential gene expression experiment in brain samples from control and PD, your background list would be all the expressed genes you tested.

What do the results mean?

gprofilerOutput$result %>% 
  DT::datatable(rownames = FALSE,
                options = list(scrollX = TRUE),
                class = 'white-space: nowrap')
  • gprofiler outputs a number of columns, a few of which are not entirely self-explanatory, including:
    • precision: this is the proportion of genes in the input gene list that can be assigned this term i.e. overlap.size/query.size
    • recall: this is the proportion of genes in the input list that overlap with the term i.e. overlap.size/term.size
    • parents: GO terms belong to families of terms. This column will provide the parent terms to the GO term found enriched.

How to visualise this information?

  • There are many ways to visualise these results, with a few examples given below.
  • How do you think this information would be best displayed? This is what we’re going to do over the next few days.

Visualising the results as a barplot

source("./R/gen_gprofiler_barplot.R")

gen_gprofiler_barplot(DF = gprofilerOutput$result, 
                      num_categories = 10,
                      plot_title = "")

Visualising the results with gprofiler’s in-built visualisation

gprofiler2::gostplot(gprofilerOutput)

Visualising as a network?

  • Nodes would represent GO terms and edges the genes shared between these GO terms
  • Potential packages:

Session info

  • It is considered good practice to always your session info in the html output of an R markdown.
  • Session info will depend on the RStudio you are using, your operating system and the packages you load. Thus, when you knit the final product of your code, it may not look like it does in this version of the html output.
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plyr_1.8.4        data.table_1.12.2 gridExtra_2.3    
##  [4] scales_1.0.0      readxl_1.3.1      forcats_0.4.0    
##  [7] stringr_1.4.0     dplyr_0.8.3       purrr_0.3.3      
## [10] readr_1.3.1       tidyr_1.0.0       tibble_2.1.3     
## [13] ggplot2_3.2.1     tidyverse_1.2.1   gprofiler2_0.1.7 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2        lubridate_1.7.4   lattice_0.20-38  
##  [4] assertthat_0.2.1  zeallot_0.1.0     digest_0.6.21    
##  [7] mime_0.7          R6_2.4.0          cellranger_1.1.0 
## [10] backports_1.1.4   evaluate_0.14     httr_1.4.1       
## [13] pillar_1.4.2      rlang_0.4.0       lazyeval_0.2.2   
## [16] rstudioapi_0.10   DT_0.9            rmarkdown_1.18   
## [19] labeling_0.3      htmlwidgets_1.3   RCurl_1.95-4.12  
## [22] munsell_0.5.0     shiny_1.3.2       broom_0.5.2      
## [25] compiler_3.6.1    httpuv_1.5.2      modelr_0.1.5     
## [28] xfun_0.11         pkgconfig_2.0.3   htmltools_0.3.6  
## [31] tidyselect_0.2.5  viridisLite_0.3.0 crayon_1.3.4     
## [34] withr_2.1.2       later_0.8.0       bitops_1.0-6     
## [37] grid_3.6.1        nlme_3.1-141      jsonlite_1.6     
## [40] xtable_1.8-4      gtable_0.3.0      lifecycle_0.1.0  
## [43] magrittr_1.5      cli_1.1.0         stringi_1.4.3    
## [46] promises_1.0.1    xml2_1.2.2        ellipsis_0.3.0   
## [49] generics_0.0.2    vctrs_0.2.0       tools_3.6.1      
## [52] glue_1.3.1        hms_0.5.1         crosstalk_1.0.0  
## [55] yaml_2.2.0        colorspace_1.4-1  rvest_0.3.4      
## [58] plotly_4.9.1      knitr_1.25        haven_2.1.1